Introduction to Solving Biological Problems Using R - Day 1

Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić, Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts

Last modified: 06 Sep 2016

Course Aims

Day 1 Schedule

  1. Introduction to R and its environment
  2. Data Structures
  3. Data Analysis Example
  4. Plotting in R

1. Introduction to R and its environment

What’s R?

The R-project page

http://www.r-project.org/

R screenshot

R screenshot

R in the New York Times

http://goo.gl/pww4ZO

New York Times, Jan 2009

New York Times, Jan 2009

R - “the ultimate virus”

University of Aukland alumni magazine, June 2016

University of Aukland alumni magazine, June 2016

http://www.ingenio-magazine.com/r-the-ultimate-virus/

R plotting capabilities

http://spatial.ly/2012/02/great-maps-ggplot2/

R plotting capabilities

https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919 R facebook

Who uses R? Not just academics!

http://www.revolutionanalytics.com/companies-using-r

R can facilitate Reproducible Research

Sidney Harris - New York Times

Sidney Harris - New York Times

It is a hot topic at the moment

New York Times, July 2011

New York Times, July 2011

Hear the full account

Nature editorial - May 2016

Nature, May 2016

Nature, May 2016

Reality check on reproducibility

1,500 scientists lift the lid on reproducibility

Various platforms supported

Getting started

Launching R Using RStudio

To launch RStudio, find the RStudio icon in the menu bar on the left of the screen and click

RStudio screenshot

RStudio screenshot

Basic concepts in R - command line calculation

2 + 2
20/5 - sqrt(25) + 3^2
sin(pi/2)

Note: The number in the square brackets is an indicator of the position in the output. In this case the output is a ‘vector’ of length 1 (i.e. a single number). More on vectors coming up…

Basic concepts in R - variables

x <- 10
x

myNumber <- 25
myNumber
sqrt(myNumber)

Basic concepts in R - variables

x + myNumber
x <- 21
x

Basic concepts in R - variables

x <- myNumber
x
myNumber <- myNumber + sqrt(16)
myNumber

Basic concepts in R - functions

sin(x)
sum(3,4,5,6)
max(3,4,5,6)
min(3,4,5,6)

Basic concepts in R - functions

seq(from = 2, to = 20, by = 4)
seq(2, 20, 4)

Basic concepts in R - vectors

x <- c(3,4,5,6)
x
x[1]
x[4]

Basic concepts in R - vectors

y <- c(2,3)
x[y]

Basic concepts in R - vectors

x <- c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
x <- 3:12
x

Basic concepts in R - vectors

x <- seq(2, 20, 4)
x
x <- seq(2, 20, length.out=5)
x
y <- rep(3, 5)
y
y <- rep(1:3, 5)
y

Basic concepts in R - vectors

x <- 3:12
# Extract elements from x:

x[3:7]
x[seq(2, 6, 2)]
x[rep(3, 2)]

Basic concepts in R - vectors

y <- c(x, 1)
y
z <- c(x, y)
z

Basic concepts in R - vectors

x <- 3:12

x[-3]
x[-(5:7)]
x[-seq(2, 6, 2)]
x

Basic concepts in R - vectors

x[6] <- 4
x

x[3:5] <- 1
x

Remember!

Basic concepts in R - vector arithmetic

x <- 1:10
y <- x*2
y
z <- x^2
z

Basic concepts in R - vector arithmetic

y + z
x + 1:2
x + 1:3
Warning in x + 1:3: longer object length is not a
multiple of shorter object length
 [1]  2  4  6  5  7  9  8 10 12 11

Basic concepts in R - Character vectors and naming

gene.names <- c("Pax6", "Beta-actin", "FoxP2", "Hox9")
gene.names
[1] "Pax6"       "Beta-actin" "FoxP2"     
[4] "Hox9"      

Basic concepts in R - Character vectors and naming

gene.expression <- c(0, 3.2, 1.2, -2)
names(gene.expression) <- gene.names
gene.expression
      Pax6 Beta-actin      FoxP2       Hox9 
       0.0        3.2        1.2       -2.0 
names(gene.expression)

Exercise: genes and genomes

Species Genome size (Mb) Protein coding genes
Homo sapiens 3,102 20,774
Mus musculus 2,731 23,139
Drosophila melanogaster 169 13,937
Caenorhabditis elegans 100 20,532
Saccharomyces cerevisiae 12 6,692

Exercise: genes and genomes

  1. Let’s assume a coding gene has an average length of 1.5 kilobases. On average, how many base pairs of each genome is made of coding genes? Create a new vector to record this, called coding.bases.
  2. What percentage of each genome is made up of protein coding genes? Use your coding.bases and genome.size vectors to calculate this. (See earlier slides for how to do division in R.)
  3. How many times more bases are used for coding in the human genome compared to the yeast genome? (S. cerevisiae) How many times more bases are in the human genome in total compared to the yeast genome? Look up indices of your vectors to find out.

Answers to genome exercise

genome.size  <- c(3102, 2731, 169, 100, 12)
coding.genes <- c(20774, 23139, 13937, 20532, 6692)
species.name <- c("H. sapiens","M. musculus",
                  "D. melanogaster","C. elegans",
                  "S. cerevisiae")

names(genome.size)  <- species.name
names(coding.genes) <- species.name

Answers to genome exercise

  1. To calculate the number of coding bases, we need to use the same scale as we used for genome size (1.5 kilobases is 0.0015 Megabases):
coding.bases <- coding.genes*0.0015
coding.bases
     H. sapiens     M. musculus D. melanogaster 
        31.1610         34.7085         20.9055 
     C. elegans   S. cerevisiae 
        30.7980         10.0380 

Answers to genome exercise

  1. To calculate the percentage of coding bases in each genome:
coding.pc <- coding.bases/genome.size*100
coding.pc
     H. sapiens     M. musculus D. melanogaster 
       1.004545        1.270908       12.370118 
     C. elegans   S. cerevisiae 
      30.798000       83.650000 

Answers to genome exercise

  1. To compare human to yeast:
coding.bases[1]/coding.bases[5]
H. sapiens 
  3.104304 
genome.size[1]/genome.size[5]
H. sapiens 
     258.5 

Answers to genome exercise

names(coding.pc) <- NULL
coding.pc
[1]  1.004545  1.270908 12.370118 30.798000
[5] 83.650000

Documenting your analysis with RStudio

Typing lots of commands directly to R can be tedious. A better way is to write the commands to a file and then load it into R.

Format of an R markdown file

md-format

md-format

Getting help

?seq
example(seq)

Getting help

??mean

Interacting with the R console

R packages

Installing packages

install.packages(name.of.my.package)
source("http://bioconductor.org/biocLite.R")
biocLite("PackageName")

Example: Install packages ggplot2 and DESeq

install.packages("ggplot2")
source("http://www.bioconductor.org/biocLite.R")
biocLite("DESeq2")

Example: Load packages ggplot2 and DESeq2

library(ggplot2) # loads ggplot functions
library(DESeq2)   # loads DESeq functions
library()        # Lists all the packages 
                 # you've got installed 

2. Data structures

R is designed to handle experimental data

The patients data frame

First_Name Second_Name Full_Name Sex Age Weight Consent
Adam Jones Adam Jones Male 50 70.8 TRUE
Eve Parker Eve Parker Female 21 67.9 TRUE
John Evans John Evans Male 35 75.3 FALSE
Mary Davis Mary Davis Female 45 61.9 TRUE
Peter Baker Peter Baker Male 28 72.4 FALSE
Paul Daniels Paul Daniels Male 31 69.9 FALSE
Joanna Edwards Joanna Edwards Female 42 63.5 FALSE
Matthew Smith Matthew Smith Male 33 71.5 TRUE
David Roberts David Roberts Male 57 73.2 FALSE
Sally Wilson Sally Wilson Female 62 64.8 TRUE

Character, numeric and logical data types

age    <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9, 
            63.5, 71.5, 73.2, 64.8)
firstName  <- c("Adam", "Eve", "John", "Mary",
                "Peter", "Paul", "Joanna", "Matthew",
                "David", "Sally")
secondName <- c("Jones", "Parker", "Evans", "Davis",
                "Baker","Daniels", "Edwards", "Smith", 
                "Roberts", "Wilson")

Character, numeric and logical data types

consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE, 
             FALSE, FALSE, TRUE, FALSE, TRUE)

Character, numeric and logical data types

c(20, "a string", TRUE)
[1] "20"       "a string" "TRUE"    
 class(firstName)
 class(age)
 class(weight)
 class(consent)

Factors

sex <- c("Male", "Female", "Male", "Female", "Male",
         "Male", "Female", "Male", "Male", "Female")
sex
 [1] "Male"   "Female" "Male"   "Female" "Male"  
 [6] "Male"   "Female" "Male"   "Male"   "Female"

Factors

factor(sex)
 [1] Male   Female Male   Female Male   Male  
 [7] Female Male   Male   Female
Levels: Female Male

Creating a data frame (first attempt)

patients <- data.frame(firstName, secondName, 
                       paste(firstName, secondName),  
                       sex, age, weight, consent)
patients
  firstName secondName paste.firstName..secondName. ...
1      Adam      Jones                   Adam Jones    
2       Eve     Parker                   Eve Parker    
3      John      Evans                   John Evans    
4      Mary      Davis                   Mary Davis    
5     Peter      Baker                  Peter Baker    
6      Paul    Daniels                 Paul Daniels    
7       ...                                            

Naming data frame variables

patients$age
names(patients) <- c("First_Name", "Second_Name",
                     "Full_Name", "Sex", "Age", 
                     "Weight", "Consent")
names(patients)

Naming data frame variables

patients <- data.frame(First_Name = firstName, 
                       Second_Name = secondName, 
                       Full_Name = paste(firstName,
                                         secondName), 
                       Sex = sex,
                       Age = age,
                       Weight = weight, 
                       Consent = consent)
names(patients)

Factors in data frames

patients$First_Name

Factors in data frames

patients <- data.frame(First_Name = firstName, 
                       Second_Name = secondName, 
                       Full_Name = paste(firstName,
                                         secondName), 
                       Sex = factor(sex),
                       Age = age,
                       Weight = weight,
                       Consent = consent,
                       stringsAsFactors = FALSE)
patients$Sex
patients$First_Name

Matrices

e <- matrix(1:10, nrow=5, ncol=2)
e
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10
rowMeans(e)
[1] 3.5 4.5 5.5 6.5 7.5

Indexing data frames and matrices

e[1,2]
e[1,]
patients[1,2]
patients[1,]

Advanced indexing

s <- letters[1:5]
s
[1] "a" "b" "c" "d" "e"
# View some of the values in s:
s[c(1,3)]
s[c(TRUE, FALSE, TRUE, FALSE, FALSE)]

Advanced indexing

a <- 1:5

# Logical tests:
a < 3
[1]  TRUE  TRUE FALSE FALSE FALSE
s[a < 3]
[1] "a" "b"

Operators

s
[1] "a" "b" "c" "d" "e"
a
[1] 1 2 3 4 5
s[a > 1 & a <3]
s[a == 2]

Exercise: exercise2.Rmd

Logical indexing answers: solution-exercise2.pdf

  1. Patients under 40:
patients[patients$Age < 40, ]
  1. Patients who give consent to share their data:
patients[patients$Consent == TRUE, ]
  1. Men who weigh as much or more than the average European male (70.8 kg):
patients[patients$Sex=="Male" & patients$Weight>=70.8, ]

3. R for data analysis

3 steps to Basic Data Analysis

  1. Reading in data
    • read.table()
    • read.csv(), read.delim()
  2. Analysis
    • Manipulating & reshaping the data
    • Any maths you like
    • Plotting the outcome
  3. Writing out results
    • write.table()
    • write.csv()

A simple walkthrough

The Working Directory (wd)

setwd("/home/participant/Course_Materials")

0. Locate the data

Before we even start the analysis, we need to be sure of where the data are located on our hard drive

getwd()
list.files()

1. Read in the data

rawData <- read.delim("countData.txt")
myfile <- file.choose()
rawData <- read.delim(myfile)
read.csv("countData.csv")
?read.table

1b. Check the data

# View the first 10 rows to ensure import is OK
rawData[1:10,]  
View(rawData)

1c. Understanding the object

class(rawData)
[1] "data.frame"
ncol(rawData)
nrow(rawData)
dim(rawData)
str(rawData)
'data.frame':   50 obs. of  5 variables:
 $ Patient: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Nuclei : int  65 51 37 37 45 46 65 59 49 46 ...
 $ NB_Amp : int  0 3 2 2 2 4 1 1 0 0 ...
 $ NB_Nor : int  63 43 35 35 42 41 64 54 48 45 ...
 $ NB_Del : int  2 5 0 0 1 1 0 4 1 1 ...

1c. Understanding the object

colnames(rawData)
[1] "Patient" "Nuclei"  "NB_Amp"  "NB_Nor"  "NB_Del" 
rawData$Nuclei

Word of caution

Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others)

Word of caution

Handling missing values

x <- c(1, NA, 3)
length(x)

Handling missing values

mean(x, na.rm = TRUE)

mean(na.omit(x))

2. Analysis (reshaping data and maths)

# Create an index of results:
prop <- rawData$NB_Amp / rawData$Nuclei
prop > 0.33
# Get sample names of amplified patients:
amp <- which(prop > 0.33) 
amp

2. Analysis (reshaping data and maths)

plot(prop, ylim=c(0,1))
# Add a horizonal line:
abline(h=0.33) 

3. Outputting the results

write.csv(rawData[amp,], file="selectedSamples.csv")
getwd()      # print working directory
list.files() # list files in working directory

Data analysis exercise: exercise3.Rmd

Solution: solution-exercise3.pdf

norm <- which(prop < 0.33 & rawData$NB_Del == 0)
norm
 [1]  3  4  7 15 20 24 36 37 42 47
write.csv(rawData[norm,], "My_NB_output.csv")

4. Plotting in R

Plot basics

Making a Scatter Plot

patients$Weight
 [1] 70.8 67.9 75.3 61.9 72.4 69.9 63.5 71.5 73.2 64.8
plot(patients$Weight)

Making a Scatter Plot

Making a Scatter Plot of two variables

patients$Age
 [1] 50 21 35 45 28 31 42 33 57 62
plot(patients$Age, patients$Weight)

Making a barplot

barplot(patients$Age)

Making a barplot

barplot(summary(patients$Sex))

Plotting a distribution: Histogram

hist(patients$Weight)

Plotting a distribution: Boxplot

boxplot(patients$Weight, horizontal = TRUE)

Plotting a distribution: Boxplot

boxplot(patients$Weight ~ patients$Sex, horizontal = T)

Exercise: exercise4a.Rmd

  1. Import these data into R
  2. What data types are present? Try to think of ways to create the following plots from the data
    • Scatter plot two variables. e.g. Solar Radiation against Ozone
    • A histogram. e.g. Wind Speed
    • Boxplot of a continuous variable against a categorical variable. e.g. Ozone level per month

Suggestions: solution-exercise4a.pdf

weather <- read.csv("ozone.csv")
View(weather)
Ozone Solar.R Wind Temp Month Day
41 190 7.4 67 5 1
36 118 8.0 72 5 2
12 149 12.6 74 5 3
18 313 11.5 62 5 4
NA NA 14.3 56 5 5
28 NA 14.9 66 5 6

Suggestions

plot(weather$Solar.R, weather$Ozone)

Suggestions

hist(weather$Wind)

Suggestions

boxplot(weather$Ozone)

Suggestions

boxplot(weather$Ozone ~ weather$Month)

Simple customisations

plot(patients$Weight, type = "l")

Simple customisations

plot(patients$Weight, type = "b")

Simple customisations

plot(patients$Age, patients$Weight,
     main = "Relationship between Weight and Age")

Simple customisations

plot(patients$Age, patients$Weight, xlab = "Age")

Simple customisations

plot(patients$Age, patients$Weight, ylab = "Weight")

Simple customisations

plot(patients$Age,patients$Weight,
     ylab="Weight",
     xlab="Age",
     main="Relationship between Weight and Age",
     xlim=c(10,70),
     ylim=c(60,80))

Defining a colour

Use of colours

Changing the col argument to plot() changes the colour that the points are plotted in:

plot(patients$Age, patients$Weight, col = "red")

Plotting characters

plot(patients$Age, patients$Weight, pch = 16)

Plotting characters

Plotting characters

plot(patients$Age, patients$Weight, pch = "X")

Size of points

Character expansion:

plot(patients$Age, patients$Weight, cex = 3)

Size of points

Character expansion:

plot(patients$Age, patients$Weight, cex = 0.2)

Colours and characters as vectors

plot(patients$Age, patients$Weight, 
     pch = 1:10, cex = 1:5,
     col = c("red", "orange", "green", "blue"))

Other plots use the same arguments

boxplot(patients$Weight~patients$Sex,
        xlab = "Sex",
        ylab = "Weight",
        main = "Relationship between Weight and Gender",
        col  = c("blue","yellow"))

Exercise: exercise4b.Rmd

Solutions: solution-exercise4b.pdf

plot(weather$Solar.R, weather$Ozone, col="orange", pch=16,
     ylab="Ozone level", xlab="Solar Radiation", 
     main="Relationship between ozone level and 
     solar radiation")

Solutions

hist(weather$Wind, col="purple", xlab="Wind Speed",
     main="Distribution of Wind Speed", breaks = 20, 
     freq=FALSE)

Solutions

boxplot(weather$Ozone~weather$Month,col=rainbow(5),
        names=c("May", "Jun", "Jul", "Aug","Sep"),
        las=2,lab="Ozone Level",
        main="Distribution of Ozone per-month")

Solutions

library(RColorBrewer)
display.brewer.all()
display.brewer.all(colorblindFriendly = TRUE)

Solutions

boxplot(weather$Ozone ~ weather$Month, 
        col=brewer.pal(5,"Set1"))

End of Day 1

To come tomorrow…